Method and apparatus for improving thread posting efficiency in a multiprocessor data processing system

ABSTRACT

A computer implemented method, a data processing system, and computer usable program code for improving thread posting efficiency in a multiprocessor data processing system are provided. Aspects of the present invention first receive a set of threads from an application. The aspects of the present invention then group the set of threads with a plurality of processors based on a last execution of the set of threads on the plurality of processors to form a plurality of groups. The threads in each group in the plurality of groups are all last executed on a same processor. The aspects of the present invention then wake up the threads in the plurality of groups in any order.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates generally to a multiprocessor dataprocessing system. In particular, the present invention relates toimproving thread posting efficiency in a multiprocessor data processingsystem. Still more particularly, the present invention relates toimproving thread posting efficiency in a multiprocessor data processingsystem by awaking client threads based on a given central processingunit on which the client threads are expected to run.

2. Description of the Related Art

The UNIX operating system is a multi-user operating system supporting ahierarchical directory structure for the organization and maintenance offiles. In contrast with a single operating system, UNIX is a class ofsimilar operating systems. Dozens of different implementations of UNIXare present, such as Advanced Interactive executive (AIX), a version ofUNIX produced by International Business Machines Corporation. Eachimplementation is similar to use because each of these implementationsprovides a core set of basic UNIX commands.

The UNIX operating system is organized at three levels: the kernel,shell, and utilities. The kernel is the software that manages a userprogram's access to the system hardware and software resources, such asscheduling tasks, managing data/file access and storage, and enforcingsecurity mechanisms. The shell presents each user with a prompt,interprets commands typed by a user, executes user commands, andsupports a custom environment for each user. The utilities provide toolsand applications that offer additional functionality to the operatingsystem.

In the AIX operating system, users may put one or more threads to sleepby invoking a thread_wait command in the user mode and subsequentlywaking up each thread by invoking a thread_post user command. For largetransaction centric applications that comprise thousands of threads,such as DB2 Universal Database and Oracle, thread posting efficiencybecomes an issue. DB2 Universal Database is a product available fromInternational Business Machines Corporation, and Oracle is a productavailable from Oracle Corporation.

In particular, these applications perform database logging on a singlecentral processing unit (CPU) or a processor of a multiprocessor dataprocessing system. However, if the multiprocessor data processing systemhas 128 processors all generating logging requests, database loggingbecomes a bottleneck since only one or a small number of processors isused as a logger. To alleviate this problem, improvements have been madethat reduce database logging overhead by allowing the logger task towake up all of its client threads in a single system call. This systemcall is known as thread_post_many.

Thread_post_many system call wakes up all of its client threads byissuing the equivalent of a thread_post system call to individualthreads in a loop. However, thread_post_many system call only solvespart of the problem. Each update that threads running on various of the128 processors try to perform requires a logging. In addition, only oneprocessor may be used as a logger. Therefore, a relatively large numberof threads have to wait until the single logging thread completesprevious logging. Although each wait only costs a few milliseconds, thetotal waiting time becomes a problem when there are 127 processorsgenerating logging requests but only 1 processor handling them. Greaterefficiency improvement is needed for the large number of computingthreads that result from the increasing number of logging requests.

SUMMARY OF THE INVENTION

The aspects of the present invention provide a computer implementedmethod, a data processing system, and computer usable program code toimprove thread posting efficiency in a multiprocessor data processingsystem. A set of threads is received from an application. The set ofthreads is grouped with a plurality of processors based on a lastexecution of the set of threads on the plurality of processors to form aplurality of groups. The threads in each group in the plurality ofgroups are all last executed on a same processor. The threads in theplurality of groups are wakened up in any order.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features believed characteristic of the invention are setforth in the appended claims. The invention itself, however, as well asa preferred mode of use, further objectives and advantages thereof, willbest be understood by reference to the following detailed description ofan illustrative embodiment when read in conjunction with theaccompanying drawings, wherein:

FIG. 1 is a block diagram of a data processing system in which exemplaryaspects of the present invention may be implemented;

FIG. 2 is a diagram illustrating interactions between aspects of thepresent invention in accordance with an illustrative embodiment of thepresent invention;

FIGS. 3A-3E are diagrams illustrating a new thread_post_many system callfor waking up client threads based on a given central processing unit inaccordance with an illustrative embodiment of the present invention; and

FIGS. 4A-4B are flowcharts of an exemplary process for improving threadposting efficient by awaking client threads based on a given centralprocessing unit on which the client threads are expected to run inaccordance with an illustrative embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

With reference now to the figures, and in particular with reference toFIG. 1, a block diagram of a data processing system in which exemplaryaspects of the present invention may be implemented is depicted. Dataprocessing system 100 may be a symmetric multiprocessor (SMP) systemincluding a plurality of processors 101, 102, 103, and 104 connected tosystem bus 106. For example, data processing system 100 may be an IBMeServer™, a product of International Business Machines Corporation inArmonk, N.Y., implemented as a server within a network. Alternatively, asingle processor system may be employed. Also connected to system bus106 is memory controller/cache 108, which provides an interface to aplurality of local memories 160-163. I/O bridge 110 is connected tosystem bus 106 and provides an interface to I/O bus 112. Memorycontroller/cache 108 and I/O bridge 110 may be integrated as depicted.

Data processing system 100 is a logical partitioned (LPAR) dataprocessing system. Thus, data processing system 100 may have multipleheterogeneous operating systems (or multiple instances of a singleoperating system) running simultaneously. Each of these multipleoperating systems may have any number of software programs executingwithin it. Data processing system 100 is logically partitioned such thatdifferent PCI I/O adapters 120-121, 128-129, and 136, graphics adapter148, and hard disk adapter 149 may be assigned to different logicalpartitions. In this case, graphics adapter 148 provides a connection fora display device (not shown), while hard disk adapter 149 provides aconnection to control hard disk 150.

Thus, for example, suppose data processing system 100 is divided intothree logical partitions, P1, P2, and P3. Each of PCI I/O adapters120-121, 128-129, 136, graphics adapter 148, hard disk adapter 149, eachof processors 101-104, and memory from local memories 160-163 isassigned to one of the three partitions. In these examples, localmemories 160-163 may take the form of dual in-line memory modules(DIMMs), for example. DIMMs are not normally assigned on a per DIMMbasis to partitions. Instead, a partition will get a portion of theoverall memory seen by the platform. For example, processor 101, someportion of memory from local memories 160-163, and PCI I/O adapters 120,128, and 129 may be assigned to logical partition P1; processors102-103, some portion of memory from local memories 160-163, and PCI I/Oadapters 121 and 136 may be assigned to partition P2; and processor 104,some portion of memory from local memories 160-163, graphics adapter 148and hard disk adapter 149 may be assigned to logical partition P3.

Each operating system executing within data processing system 100 isassigned to a different logical partition. Thus, each operating systemexecuting within data processing system 100 may access only those I/Ounits that are within its logical partition. As an example, one instanceof the Advanced Interactive Executive™ (AIX™) operating system may beexecuting within partition P1, a second instance (image) of the AIX™operating system may be executing within partition P2, and a Windows™operating system may be operating within logical partition P3. “Windows”is a product and trademark of Microsoft Corporation of Redmond, Wash.

Peripheral component interconnect (PCI) host bridge 114 connected to I/Obus 112 provides an interface to PCI local bus 115. A number of PCIinput/output adapters 120-121 may be connected to PCI bus 115 throughPCI-to-PCI bridge 116, PCI bus 118, PCI bus 119, I/O slot 170, and I/Oslot 171. PCI-to-PCI bridge 116 provides an interface to PCI bus 118 andPCI bus 119. PCI I/O adapters 120 and 121 are placed into I/O slots 170and 171, respectively. Typical PCI bus implementations will supportbetween four and eight I/O adapters (i.e. expansion slots for add-inconnectors). Each PCI I/O adapter 120-121 provides an interface betweendata processing system 100 and input/output devices such as, forexample, other network computers that are clients to data processingsystem 100.

Additional PCI host bridge 122 may provide an interface for anadditional PCI bus 123. PCI bus 123 is connected to a plurality of PCII/O adapters 128-129. PCI I/O adapters 128-129 may be connected to PCIbus 123 through PCI-to-PCI bridge 124, PCI bus 126, PCI bus 127, I/Oslot 172, and I/O slot 173. PCI-to-PCI bridge 124 provides an interfaceto PCI bus 126 and PCI bus 127. PCI I/O adapters 128-129 are placed intoI/O slots 172 and 173, respectively. In this manner, additional I/Odevices, such as, for example, modems or network adapters may besupported through each of PCI I/O adapters 128-129. In this manner, dataprocessing system 100 allows connections to multiple network computers.

A memory mapped graphics adapter 148 inserted into I/O slot 174 may beconnected to I/O bus 112 through PCI bus 144, PCI-to-PCI bridge 142, PCIbus 141 and PCI host bridge 140. Hard disk adapter 149 may be placedinto I/O slot 175, which is connected to PCI bus 145. In turn, this busis connected to PCI-to-PCI bridge 142, which is connected to PCI hostbridge 140 by PCI bus 141.

PCI host bridge 130 provides an interface for PCI bus 131 to connect toI/O bus 112. PCI I/O adapter 136 is connected to I/O slot 176, which isconnected to PCI-to-PCI bridge 132 by PCI bus 133. PCI-to-PCI bridge 132is connected to PCI bus 131. This PCI bus also connects PCI host bridge130 to service processor mailbox interface and ISA bus accesspass-through 194 and PCI-to-PCI bridge 132. Service processor mailboxinterface and ISA bus access pass-through 194 forwards PCI accessesdestined to PCI/ISA bridge 193. NVRAM 192 is connected to ISA bus 196.

Service processor 135 is coupled to service processor mailbox interfaceand ISA bus access pass-through logic 194 through its local PCI bus 195.Service processor 135 is also connected to processors 101-104 via aplurality of JTAG/I²C busses 134. JTAG/I²C busses 134 are a combinationof JTAG/scan busses (see IEEE 1149.1) and Phillips I2C busses. However,alternatively, only Phillips I2C busses or only JTAG/scan busses mayreplace JTAG/I2C busses 134. All SP-ATTN signals of processors 101, 102,103, and 104 are connected together to an interrupt input signal of theservice processor. Service processor 135 has its own local memory 191and has access to OP-panel 190.

When data processing system 100 is initially powered up, serviceprocessor 135 uses JTAG/I²C busses 134 to interrogate the system (host)processors 101-104, memory controller/cache 108, and I/O bridge 110. Atcompletion of this step, service processor 135 has an inventory andtopology understanding of data processing system 100. Service processor135 also executes Built-In-Self-Tests (BISTs), Basic Assurance Tests(BATs), and memory tests on all elements found by interrogatingprocessors 101-104, memory controller/cache 108, and I/O bridge 110. Anyerror information for failures detected during the BISTs, BATS, andmemory tests are gathered and reported by service processor 135.

If a meaningful/valid configuration of system resources is stillpossible after taking out the elements found to be faulty during theBISTs, BATs, and memory tests, then data processing system 100 isallowed to proceed to load executable code into local (host) memories160-163. Service processor 135 then releases processors 101-104 forexecution of the code loaded into local memory 160-163. While processors101-104 are executing code from respective operating systems within dataprocessing system 100, service processor 135 enters a mode of monitoringand reporting errors. The type of items monitored by service processor135 include, for example, the cooling fan speed and operation, thermalsensors, power supply regulators, and recoverable and non-recoverableerrors reported by processors 101-104, local memories 160-163, and I/Obridge 110.

Service processor 135 is responsible for saving and reporting errorinformation related to all of the monitored items in data processingsystem 100. Service processor 135 also takes action based on the type oferrors and defined thresholds. For example, service processor 135 maytake note of excessive recoverable errors on a processor's cache memoryand decide that this is predictive of a hard failure. Based on thisdetermination, service processor 135 may mark that resource fordeconfiguration during the current running session and future InitialProgram Loads (IPLs). IPLs are also sometimes referred to as a “boot” or“bootstrap.”

Data processing system 100 may be implemented using various commerciallyavailable computer systems. For example, data processing system 100 maybe implemented using IBM eServer™ iSeries® Model 840 system availablefrom International Business Machines Corporation. Such a system maysupport logical partitioning using an OS/400 operating system, which isalso available from International Business Machines Corporation.

Those of ordinary skill in the art will appreciate that the hardwaredepicted in FIG. 1 may vary. For example, other peripheral devices, suchas optical disk drives and the like, also may be used in addition to orin place of the hardware depicted. The depicted example is not meant toimply architectural limitations with respect to the present invention.

The processes of the present invention may be implemented within thekernel of an operating system, such as the AIX operating system. Largetransaction centric applications, such as DB2 Universal Database andOracle may utilize the aspects of the present invention to improvethread posting efficiency. Aspects of the present invention improvethread posting efficiency in a multiprocessor data processing system,such as data processing system 100 in FIG. 1, by replacing the currentthread_post_many system call with a new thread_post_many system callthat wakes up client threads in a new sequence which is based on thegiven processor on which each of the client threads is individuallyexpected to run. Instead of waking up client threads in the order thatthey are inserted by the application, the aspects of the presentinvention perform a heap sort on the client threads and link all theclient threads to be awakened on a given processor together. However,sorting methods other than a heap sort that sort threads based on agiven processor may be performed without departing the spirit and scopeof the present invention.

In one exemplary implementation, there may be a maximum of 512 threadsto be awakened, which are scattered among all 128 processors. For eachof the threads to be awakened, a lock that is specific to the processor,known as a run queue lock, needs to be acquired in order to serializethe wake ups before the lock is released. For example, when awakeningone thread on one processor and five threads on another processor, anappropriate run queue lock has to be acquired six times. Instead ofwaking up threads in first-in-first-out (FIFO) fashion as currentlyperformed by the database application, the aspect of the presentinvention sorts the list of client threads according to processors thatthe client threads are expected to run on.

As a result of the sort, cycle time can be saved with the aspects of thepresent invention. The cycle time is saved by setting how many threadsare to be awakened per processor. When the number of threads is set toten, for example, the run queue lock that is required to wake up the tenthreads on processor 3 only needs to be acquired once before the lock isreleased. Otherwise, if the FIFO order as currently used by theapplication is followed, the run queue lock may need to be acquired upto ten times.

As an alternative to cycle time savings by setting how many threads areto be awakened per processor, memory cache affinity benefits may beachieved with the aspects of the present invention. As threads that aretargeted at the same processor are awakened, the internal structures ofthese threads are linked together on the run queue for that processor.Thus, data required to link these threads onto the list is in the cacheas the same list of threads is referenced multiple times consecutively.In this way, memory cache affinity benefit is achieved.

As to the heap sort, since sorting 512 thread wake up requests may betime consuming, the aspects of the present invention make a shortcut.This shortcut runs through the list of client threads only once andlinks all threads that are expected to execute on the same processor,and hence were placed on the same run queue, together. For example, allthreads that are executing on processor 1 are linked together, while allthreads that are executing on processor 2 are linked separately. In thisway, only one pass through the list of client threads completes thesort.

After sorting and linking the threads by processor, these threads areawakened in an order different from the order in which the applicationimplied by their order in its wakeup list, for example, a reverse order.In other words, when the threads are pushed onto 128 individual stacks,each thread is pulled off the stacks in the opposite direction to beawakened. This is known as a Last-In-First-Out (LIFO) approach. The LIFOapproach gives memory cache benefits because the last thread that ispulled off the stack most likely still has data in the cache.

For example, if ten threads recently ran on a given processor, the lastthread that ran is the thread that should still have data inside thecache because the earlier threads in the list continued to wait for thelogger. Therefore, even though cycle time savings may not be achievedshould the function of batching the locking is turned off, memory cachebenefits may still be achieved by waking up threads according to theLIFO approach. In addition to LIFO, different orders of waking upthreads may be implemented without departing the spirit and scope of thepresent invention. For example, a user may define a preferred order towake up the threads.

While the number of threads to be awakened per processor, which holdsthe lock long enough for a maximum desired number of threads at a timemay be adjusted, there are risks involved. In one exemplaryimplementation, if the number is adjusted to wake up all the threads ona given processor, the application may remain disabled for interruptsfor a long time. For example, an interrupt may be delayed during 500wake ups all directed to a single processor, which results in poorutilization of I/O devices.

On the other hand, if the number is adjusted to wake up only five or tenthreads at a time, the aspects of the present invention may release thelock and enable for interrupts, disable for interrupts, and thenre-acquire the lock. With this adjustment, while giving up cycle savingbenefits, interrupts may be handled more responsively. In addition,between waking up threads on processor 3 and threads on processor 4, theaspects of the present invention are able to keep up with the I/Odevices by releasing the lock for run queue 3, enabling for interrupts,disabling for interrupts, and then acquiring the lock for run queue 4.In a preferred embodiment, however, the number of threads to be awakenedis adjusted to one thread at a time.

Turning now to FIG. 2, a diagram illustrating interactions betweenaspects of the present invention is depicted in accordance with anillustrative embodiment of the present invention. As shown in FIG. 2,application 202 executes within operating system 200 of a multiprocessordata processing system. An example of a multiprocessor data processingsystem is multiprocessor data processing system 100 in FIG. 1. Withinapplication 202, a number of threads are executed to perform variousfunctions. These threads are identified by their thread IDs. In thisexample, thread ID 1 to thread ID 9 are executed within application 202.In these examples, a total number of 512 threads and 128 processors maybe present in a multiprocessor data processing system.

Thread ID 1 to thread ID 9 may last run on any processor within amultiprocessor data processing system. For example, thread ID 1, threadID 4, and thread ID 7 last ran on processor 0. Thread ID 2, thread ID 3,and thread ID 5 last ran on processor 1. Thread ID 6, thread ID 8, andthread ID 9 last ran on processor 2. Examples of processor 0, processor1, and processor 2 include processor 101, 102, and 103 in FIG. 1.

The aspects of the present invention provide a new call,thread_post_many system call 204, which takes the threads waiting withinapplication 202 and sorts them based on which processor each of threadsis last ran on. After the threads are sorted, thread_post_many systemcall 204 selects a subset of threads that last ran on the same processorin the multiprocessor data processing system and wakes up the subset ofthreads of each processor in any given order.

For example, thread_post_many system call 204 takes the threads that areexecuting within application 202 and sorts them based on which processoreach thread is last ran on. Thread_post_many system call 204 thenselects a subset of threads that last ran on the same processor. Forexample, thread ID 1, thread ID 4, and thread ID 7, which last ran onprocessor 0. After the subset of threads are selected, thread_post_manysystem call 204 wakes up the subset of threads in any given order, forexample, thread ID 4 may be awakened first, then thread ID 7, and threadID 1. However, in one embodiment, the threads will be awakened in a LIFOorder of thread ID 7, thread ID 4, thread ID 1.

Turning now to FIGS. 3A-3E, diagrams illustrating a new thread_post_manysystem call for waking up client threads based on a given centralprocessing unit are depicted in accordance with an illustrativeembodiment of the present invention. New thread_post_many system call200 may be implemented with an operating system executing within a dataprocessing system, such as AIX operating system kernel, executing withindata processing system 100 in FIG. 1.

As shown in FIG. 3A, in this example implementation, a newthread_post_many system call 300, which takes three input parameters:nthreads 302, tidp 304, and erridp 306. nthreads 302 represents thenumber of threads to wake up in an application. tidp 304 represents athread identifier pointer for an array of thread identifiers, forexample, an array of 173 thread IDs. erridp 306 represents an errorpointer, pointing to where errors will be logged. In this exampleimplementation, thread_post_many system call 300 may accommodate 32 bitand 64 bit user programs. Thus, the sizes of thread IDs, tidp 304, anderror pointer, erridp 306, are scaled accordingly.

Turning now to FIG. 3B, a diagram illustrating thread_post_many systemcall 300 in continuation of FIG. 3A is depicted in accordance with anillustrative embodiment of the present invention. As shown in FIG. 3B,when thread_post_many system call 300 receives a user array, itallocates memory storage, ktidp 310, for the size of the thread IDs ofthe user array plus the size of a short integer times the number of userthreads, in order to organize the threads by processor. For example,thread_post_many system call 300 may allocate 173 user thread IDs plus173 short integers in order to create 128 [MAXCPU] linked lists.

Next, thread_post_many system call 300 identifies nexti 312, which is apointer to link indices at an address following all of the thread IDs.In this case, the memory storage of a single request is partitioned intoa big area for the thread IDs and a small area for subscript numbers.Next, thread_post_many system call 300 uses a kernel service, COPYIN314, which goes to the user's memory and fetches what is in the user'smemory, for example, tidp64 316, to the pinned memory thatthread_post_many system call 300 allocated previously, ktidp 310. Thisenables thread_post_many system call 300 to avoid issues such as pagefaults later.

Next, thread_post_many system call 300 includes for loop 318 which setsall list headers to −1 to initialize the processor subscripts. headi[i]319 is defined as headi[MAXCPU] 308 in FIG. 3A. MAXCPU is the maximumnumber of processors that are supported on a given multiprocessor dataprocessing system. Thus, for a 128 processors system, MAXCPU is 128 andheadi[i] 319 represents 128 list headers. Index i represents aparticular processor, for example, headi[3] represents a list header forprocessor 3. The value of ktidp[i] 307 in FIG. 3A, on the other hand,represents the thread ID that is to be awakened. Ktidp is short-hand fora ktidp32 or ktidp64, which reflects whether the application is runningin a 32-bit or 64-bit mode. If headi[i] has a value of −1, there are nothreads to be awakened on processor i. If headi[i] has value j, itrepresents the thread identified by ktidp[j], the j-th thread IDprovided by the application's tidp array 304 in FIG. 3A.

headi[i] 319 gives a construct similar to a linked list that includesall the rest of the threads to be awakened on a given processor. Sinceheadi[i] 319 is only a short integer, it is not enough to hold pointers,which are either 32 or 64 bits long depending on the kernel. Therefore,instead of using pointers, thread_post_many system call 300 usesprocessor subscripts to save memory. In other words, instead of using alinked list, which is a data structure in which each element contains anaddress of the next element, thread_post_many system call 300 uses adata structure in which each element contains a subscript numberidentifying the next element.

Turning now to FIG. 3C, a diagram illustrating thread_post_many systemcall 300 in continuation of FIG. 3B is depicted in accordance with anillustrative embodiment of the present invention. As shown in FIG. 3C,after thread_post_many system call 300 runs through a first pass to thelist of client threads in the order the threads are passed up by theapplication, thread_post_many system call 300 determines, for eachthread in the list, if it is valid to wake up the thread at this timeand where the thread has last ran. First, thread_post_many system call300 includes for loop 320, which validates each thread to determine ifthe thread ID is valid and if permission exists to wake up the thread.

Next, for loop 320 determines for each thread in the list where thethread has last ran. For example, if a thread with a thread ID ktidp[0]has last ran on processor 3, thread_post_many system call 300 assignsheadi[3] with a value of 0 to represent that this thread is to beawakened on processor 3. After all 512 threads IDs are examined, theremay be a possibility that another thread also runs on processor 3. Inthis case, since there is already a value in the headi[3],thread_post_many system call 300 has to preserve the value that iscurrently in headi[3]. Thread_post_many system call 300 preserves thevalue as illustrated in statement 326 as described below.

Continuing with the previous example, if the thread with a thread IDktidp[0] is passed up by the application as the last thread that is ranon processor 3 thread_post_many system call 300 assigns headi[3] to 0.This step is illustrated by statement 324 in FIG. 3C. Next,thread_post_many system call 300 assigns the old value of headi[3],which is −1, to nexti[0]. This step is illustrated by statement 326 inFIG. 3C. As a result, nexti[0]=−1. Later, should the fifth element ofthe thread ID list that is passed up from the application, also has anaffinity of 3, thread_post_many system call 300 takes the 0 fromheadi[3] and pushes it into nexti[5]. In this way, if thread ID 5 is tobe awakened, nexti[5] is looked up by thread_post_many system call 300,such that thread 0 is to be awakened as well.

For loop 320 illustrates that threads may be linked together bysubscripts instead of addresses. For example, thread_post_many systemcall 300 may pick up headi[i] and get the subscript of some thread inthe user original array and use that subscript to determine which threadis next to be awakened up to 128 times. Thus, every thread that isflagged for processor 0, 1, 2 and so on may be awakened in the LIFOorder.

Turning now to FIG. 3D, a diagram illustrating thread_post_many systemcall 300 in continuation of FIG. 3C is depicted in accordance with anillustrative embodiment of the present invention. As shown in FIG. 3D,after the client threads are sorted and linked, thread_post_many systemcall 300 includes for loop 330, which loops through each processor thatis actually on the multiprocessor data processing system and wakes upthe threads for that processor in a LIFO order.

For loop 330 first determines if the value of headi[i], which representsa thread ID that last ran on processor i, is −1. If so, there are nothreads to wake up in processor i. This step is illustrated by statement332 in FIG. 3D. However, if the value of headi[i] is not equal to −1,for loop 330 obtains the run queue for processor i. Then, do-while loop334 within for loop 330 wakes up all the threads that are collected foreach processor until there are no more threads to wake up for theprocessor, and thus, j=−1.

As described above, the number of threads to be awakened per run queuelock acquisition per processor may be adjusted to avoid holding the lockfor too long. Do-while loop 334 provides a variable k 336 to keep trackof how many threads have been awakened per processor. NumberPosts 338 isa constant that is adjustable to represent the number of threads to beawakened per processor such that prolonged lock holding may be avoided.In a preferred embodiment, only 1 thread is to be awakened at one time.If the number of threads to be awakened per lock acquisition isexceeded, unlock_enable_mem 340 unlocks the run queue for processor iand enables interrupts, and then disables interrupts and relocks the runqueue for processor i. In this way, the lock will only be held for theprocessing of up to k threads at a time.

For the threads that are awakened, the thread ids are converted into aninternal structure pointer. This step is illustrated by statement 342 inFIG. 3D. In order to wake up the threads, not only does the run queueneed to be locked, each thread also has to be locked. When obtaining alock for the thread, a locking hierarchy is involved. A lockinghierarchy is a hierarchy that governs the order in which locks may beobtained without running into deadlocks. Deadlock is when two processesare each waiting for the other to complete before proceeding, whichresults in both processes hanging. The locking hierarchy requireslocking the thread prior to locking the run queue. However, aspects ofthe present invention lock the run queue in order to batch the threadsonto the run queue before locking the threads, and thus, are subject todeadlock.

In order to avoid deadlock, do-while loop 334 includes simple_lock_try344, which acquires the thread lock if it is available. If the threadlock is not available, instead of waiting and spinning, the lock requestis failed and an error is returned to the user. If the thread lock isnot available, in order to avoid deadlock, the run queue lock isunlocked 346, which occasionally gives up saving cycles, and then thethread is locked 348. Thereafter, the run queue is locked again 350. Inthis way, thread and run queue locking may be performed in a safe order.

Furthermore, do-while loop 334 also includes an internal service,et_post_rc 352, which provides the ability to wake up the thread on arun queue other than the one that is locked. In most cases, a thread isawakened under the assumption that the thread that is unlocked remainson the processor that it last ran on. Thus, the thread is most likelybound to where it last ran. However, though rare, if a third partythread binds that thread to be run on a different processor, et_post_rc352, which has the thread lock, detects that the run queue lock obtainedwas for a wrong run queue. In turn, et_post_rc 352 unlocks the wrong runqueue, locks the correct run queue, wakes up the thread, unlocks thecorrect run queue, and relocks the wrong run queue. This is known ashidden error recovery. After the thread is awakened, it is then unlocked353.

Turning now to FIG. 3E, a diagram illustrating thread_post_many systemcall 300 in continuation of FIG. 3D is depicted in accordance with anillustrative embodiment of the present invention. As shown in FIG. 3E,if a failure occurs during the wake ups, it is most likely anapplication failure, because the thread is locked when thread_post_manysystem call 300 tries to wake it up. In case a failing thread exists,the failing thread ID is copied out from the kernel to user 354.Finally, the memory that is allocated previously is deallocated 356.

Turning now to FIG. 4A, a flowchart of an exemplary process forimproving thread posting efficiency by awaking client threads based on agiven central processing unit on which the client threads are expectedto run is depicted in accordance with an illustrative embodiment. Thisexemplary process may be implemented within a kernel of an operatingsystem, such as the AIX kernel. The process begins when thread_post_manysystem call receives a user array of thread IDs (step 402). Next,thread_post_many system call allocates memory in order to organize thethreads by each processor (step 404).

Thread_post_many system call then identifies nexti address, whichfollows all the thread IDs in the memory (step 406) and fetches from theuser memory to the allocated memory all information about the threads(step 408). A first for loop within thread_post_many system call theninitializes all the processor subscript list headers by setting thevalues to −1 (step 410). Next, a first for loop starts and obtainsinformation about each thread that is to be awakened in the list (step412).

Then, the next thread is obtained from the list (step 415) and adetermination is made as to whether the thread is valid (step 416). Ifthe thread is invalid, the process continues to step 423. If the threadis valid, a determination is made as to whether the requester haspermission (step 417). If the requester does not have permission, thefirst for loop terminates (step 424). However, if the requester haspermission, the processor on which the thread is last ran is determined(step 418).

Then, the for loop assigns the list header for the processor, which iswhere the thread is last ran, with the thread ID of the thread that isto be awakened (step 420). In addition, the old value of the list headeris assigned to nexti for the thread (step 422) in order to preserve it.The old value also links the threads to be awakened on a given processortogether. At step 423, a determination is made as to whether additionalthreads are present. If additional threads are present, the processreturns to step 415 to obtain the next thread. Otherwise, the first forloop then terminates (step 424) and the process terminates thereafter.

Turning now to FIG. 4B, a flowchart of an exemplary process incontinuation of FIG. 4A for improving thread posting efficiency byawaking client threads based on a given central processing unit on whichthe client threads are expected to run is depicted in accordance with anillustrative embodiment. This process continues from step 424 in FIG. 4Aand begins when a second for loop starts (step 430) and retrieves thenext processor (step 431). A second for loop then makes a determinationas to whether a thread exists by examining the value of the list headerof the processor (step 432). If a thread does not exist, a determinationis made as to whether additional processors are present (step 448). Ifadditional processors are present, the process returns to step 431 toretrieve the next processor. If additional processors are not present,the process continues to step 449.

Turning back to step 432, if a thread exists, the run queue of theprocessor is locked (step 434) and a determination is made as to whethera maximum number of consecutive threads is awakened (step 438). Themaximum number of consecutive threads to be awakened at one time is aconstant known as numberPosts. If the maximum number of consecutivethreads is awakened, the run queue of the processor is unlocked andrelocked (step 452) and the process continues to step 440. If a maximumnumber of consecutive threads is not awakened at step 438, the processproceeds to make a determination as to whether the thread is lockedwithout delay (step 440). If the thread is locked without delay, thethread is awakened and any error is noted (step 442). However, if thethread could not be locked without delay, the run queue is unlocked, thethread is locked, and the run queue is relocked (step 454). The processthen continues to step 442, where the thread is awakened and any erroris noted.

When waking up the thread at step 442, a hidden error recovery isperformed by an et_post_rc routine, which makes a determination as towhether the thread is on a wrong run queue, which is different from therun queue that is locked in step 434. If the thread is on a wrong runqueue, et_post_rc wakes up the thread by unlocking the wrong run queue,locking the correct run queue, waking up the thread, unlocking thecorrect run queue, and relocking the wrong run queue.

After the thread is awakened at step 442, the thread is unlocked (Step444). A determination is then made as to whether additional threads arepresent (step 446). If additional threads are not present, the run queueof the processor is unlocked (step 456) and returns to step 448 toproceed to the next processor. If additional threads are present, theprocess returns to step 436 to retrieve the next thread for thisprocessor. If additional processors are absent in step 448, the secondfor loop then terminates (step 449). Any error that is noted and thethread ID are copied out (step 450) and the allocated memory is freed(step 451). Thus, the process terminates thereafter.

In summary, with the aspects of the present invention, the impact on theprocessor's memory cache may be minimized because neighboring threadsare linked together in a two-way chain when awakened, which affects thecache that is involved in the neighboring threads when each thread isawakened. In addition, as a result of the heap sort, the last thread inthe list becomes the first thread that is awakened. In other words, thethread that last ran in the processor becomes the first thread to beawakened. This Last-In-First-Out (LIFO) tendency benefits memory caches,since the last thread that goes to sleep on a processor is most likelythe thread that still has data residing in the processor memory cache.

The invention can take the form of an entirely hardware embodiment, anentirely software embodiment or an embodiment containing both hardwareand software elements. In a preferred embodiment, the invention isimplemented in software, which includes but is not limited to firmware,resident software, microcode, etc.

Furthermore, the invention can take the form of a computer programproduct accessible from a computer-usable or computer-readable mediumproviding program code for use by or in connection with a computer orany instruction execution system. For the purposes of this description,a computer-usable or computer readable medium can be any apparatus thatcan contain, store, communicate, propagate, or transport the program foruse by or in connection with the instruction execution system,apparatus, or device.

The medium can be an electronic, magnetic, optical, electromagnetic,infrared, or semiconductor system (or apparatus or device) or apropagation medium. Examples of a computer-readable medium include asemiconductor or solid state memory, magnetic tape, a removable computerdiskette, a random access memory (RAM), a read-only memory (ROM), arigid magnetic disk and an optical disk. Current examples of opticaldisks include compact disk—read only memory (CD-ROM), compactdisk—read/write (CD-R/W) and digital video disc (DVD).

A data processing system suitable for storing and/or executing programcode will include at least one processor coupled directly or indirectlyto memory elements through a system bus. The memory elements can includelocal memory employed during actual execution of the program code, bulkstorage, and cache memories which provide temporary storage of at leastsome program code in order to reduce the number of times code must beretrieved from bulk storage during execution.

Input/output or I/O devices (including but not limited to keyboards,displays, pointing devices, etc.) can be coupled to the system eitherdirectly or through intervening I/O controllers.

Network adapters may also be coupled to the system to enable the dataprocessing system to become coupled to other data processing systems orremote printers or storage devices through intervening private or publicnetworks. Modems, cable modems, and Ethernet cards are just a few of thecurrently available types of network adapters.

The description of the present invention has been presented for purposesof illustration and description, and is not intended to be exhaustive orlimited to the invention in the form disclosed. Many modifications andvariations will be apparent to those of ordinary skill in the art. Theembodiment was chosen and described in order to best explain theprinciples of the invention, the practical application, and to enableothers of ordinary skill in the art to understand the invention forvarious embodiments with various modifications as are suited to theparticular use contemplated.

1. A computer implemented method for thread posting efficiency in amultiprocessor data processing system, the computer implemented methodcomprising: receiving a set of threads from an application; grouping theset of threads with a plurality of processors based on a last executionof the set of threads on the plurality of processors to form a pluralityof groups, wherein threads in each group in the plurality of groups areall last executed on a same processor; and waking up the threads in agroup in the plurality of groups prior to waking up the threads inanother group in the plurality of groups.
 2. The computer implementedmethod of claim 1, wherein the grouping step comprises: sorting eachthread in the set of threads based on a processor in the plurality ofprocessors that a particular thread last ran on to form a sorted set ofthreads; and selecting, for each processor in the plurality ofprocessors, a subset of threads from the set of sorted threads, whereinthe subset of threads forms a group in the plurality of groups and islast executed on a particular processor.
 3. The computer implementedmethod of claim 1, wherein the waking up step comprises: waking up asingle thread within each group in the plurality of groups prior towaking up another thread in the each group of the plurality of groups.4. The computer implemented method of claim 1, wherein the waking upstep comprises: waking up all of the threads in the group in theplurality of groups prior to waking up all of the threads in the anothergroup in the plurality of groups.
 5. The computer implemented method ofclaim 2, wherein the sorting step comprises: allocating memory storagefor a list header of a processor; identifying link indices, wherein thelink indices link the subset of threads for each processor; andinitializing a value of the list header.
 6. The computer implementedmethod of claim 5, wherein the sorting step comprises: determining, foreach thread in the set of threads, if the thread is valid and permissionexists to wake up the thread; if the thread is valid and permissionexists to wake up the thread, determining a processor on which thethread is last ran; and assigning a thread identifier of the thread tothe list header for the processor on which the thread is last ran. 7.The computer implemented method of claim 6, wherein the assigning stepcomprises: preserving a current value of the list header for theprocessor by assigning the current value to a link index in the linkindices associated with the thread.
 8. The computer implemented methodof claim 7, wherein the waking up step comprises: determining, for eachprocessor in the plurality of processors, if a thread is present basedon the list header; if a thread is present, locking a run queue of theprocessor; retrieving the thread; and determining if a maximum number ofthreads for the processor is awakened.
 9. The computer implementedmethod of claim 8, wherein the waking up step further comprises: if amaximum number of threads for the processor is awakened, unlocking therun queue of the processor; and relocking the run queue of theprocessor.
 10. The computer implemented method of claim 9, wherein thewaking up step further comprises: if a maximum number of threads for theprocessor is not awakened, determining if the thread can be lockedwithout delay; if the thread cannot be locked without delay, unlockingthe run queue of the processor; locking the thread with delay; relockingthe run queue of the processor; and waking up the thread.
 11. Thecomputer implemented method of claim 10, wherein the waking up stepfurther comprises: if the thread is locked without delay, waking up thethread; determining if additional threads are present in the processor;if additional threads are absent in the processor, unlocking the runqueue of the processor; and copying out an error if additionalprocessors are absent in the plurality of processors and if one of thesubset of threads cannot be awakened.
 12. The computer implementedmethod of claim 11, wherein the waking up step further comprises:determining if the thread is on a wrong run queue; if the thread is on awrong run queue, waking up the thread by unlocking the wrong run queue,locking a correct run queue, waking up the thread, unlocking the correctrun queue, and relocking the wrong run queue.
 13. The computerimplemented method of claim 1, wherein the waking up step furthercomprises: waking up the threads in the group in the plurality of groupsin a last-in-first-out order.
 14. A data processing system for improvingthread posting efficiency, the data processing system comprising: a bus,a storage device, wherein the storage device contains computer usablecode; a communications unit connected to the bus; and a processing unitcomprising a plurality of processors connected to the bus, wherein theprocessing unit executes the computer usable code to receive a set ofthreads from an application; group the set of threads with a pluralityof processors based on a last execution of the set of threads on theplurality of processors to form a plurality of groups, wherein threadsin each group in the plurality of groups are all last executed on a sameprocessor; and wake up the threads in a group in the plurality of groupsprior to waking up the threads in another groups in the plurality ofgroups.
 15. The data processing system of claim 14, wherein theprocessing unit, in executing the computer usable code to group the setof threads with a plurality of processors based on a last execution ofthe set of threads on the plurality of processors to form a plurality ofgroups, executes the computer usable code to sort each thread in the setof threads based on a processor in the plurality of processors that aparticular thread last ran on to form a sorted set of threads; andselect, for each processor in the plurality of processors, a subset ofthreads from the set of sorted threads, wherein the subset of threadsforms a group in the plurality of groups and is last executed on aparticular processor.
 16. The data processing system of claim 14,wherein the processing unit, in executing the computer usable code towake up the threads in a group in the plurality of groups prior towaking up the threads in another group in the plurality of groups,executes the computer usable code to wake up a single thread within eachgroup in the plurality of groups prior to waking up another thread inthe each group in the plurality of groups.
 17. The data processingsystem of claim 14, wherein the processing unit, in executing thecomputer usable code to wake up the threads in a group in the pluralityof groups prior to waking up the threads in another group in theplurality of groups, executes the computer usable code to wake up all ofthe threads in the group in the plurality of groups prior to waking upall of the threads in the another group in the plurality of groups. 18.A computer program product comprising: a computer usable medium havingcomputer usable program code for improving thread posting efficiency ina multiprocessor data processing system, said computer program productincluding: computer usable program code for receiving a set of threadsfrom an application; computer usable program code for grouping the setof threads with a plurality of processors based on a last execution ofthe set of threads on the plurality of processors to form a plurality ofgroups, wherein threads in each group in the plurality of groups are alllast executed on a same processor; and computer usable program code forwaking up the threads in a group in the plurality of groups prior towaking up the threads in another group in the plurality of groups. 19.The computer program product of claim 18, wherein the computer usableprogram code for grouping the set of threads with a plurality ofprocessors based on a last execution of the set of threads on theplurality of processors to form a plurality of groups comprises:computer usable program code for sorting each thread in the set ofthreads based on a processor in the plurality of processors that aparticular thread last ran on to form a sorted set of threads; andcomputer usable program code for selecting, for each processor in theplurality of processors, a subset of threads from the set of sortedthreads, wherein the subset of threads forms a group in the plurality ofgroups and is last executed on a particular processor.
 20. The computerprogram product of claim 19, wherein the computer usable program codefor waking up the threads in a group in the plurality of groups prior towaking up the threads in another group in the plurality of groupscomprises: computer usable program code for waking up a single threadwithin each group in the plurality of groups prior to waking up anotherthread in the each group of the plurality of groups.